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ABSTHACT 

Test construction is not the strictly logical proems 
ttiat we night wish it to be. This is particularly true in a large 
on-goljig proji^t such as the National Assessaent o£ Educati<»)al 
Progress (t^EP) • K09t of tho really deep questions can only bo 
answered by the exercise of well-informed hunan jwignent. 
Criterion-referenced testing is still a tern in search of definition. 
It has been suggested that I1AQ.*8 exercises night be nore properly 
called **objective referenced" tests. That is a reasonable title for 
our efforts since we are attenpting to assess the degree of 
a<Aieventtnt of stated goals without reference to a prodeternined 
level or criterion. Whatever tho appropriate title nay he, share 
the concerns of all workers in the field with the sane basic 
questions. But until satisfactory scientific solutions have been 
found; we« like the rest of education, nust rely on the best hunan 
jodgnent available. (Author) 
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A HUMANISTIC APPROACII TO 
CRITERION RErERLNCED TESTING 

An adequate thtsory In dcknce nerves as a foundation for pracfUal 
applUationi ai wi*U at a framework for experlmenutlom Intelleciuat 
actlvittcd that lack ^uch a solid thisoretical basis might be realistically 
considered arts rather than sciences* Much of education is art in that 

activities are based on the intaition and judgment of practitioners rather 

* 

than logical extensions of quantifiable theory* The field oC educational 

measurement is an exception since an extensive body of theory has been 

developed that guides the activities of norm-referenced testing. Howeve 

a much older tradition exists in educational measurement that attempts 

to determine the aboslute achievement of the Individual or population 

without regard to interpersonal comparisons, . That tradition, which is 

currently called criterion referenced testings can call on much of the 

statistical technic that is used in other fields of measurement* It is 

(acedi however, with important problems In basic theory that norm* 

referenced testing can^ by definition, safely ignore* 

Activity in criterion referenced testing, like the rest of education, 

cannot be delayed until basic applicable theory Is developed. Schools 

« 

cannot close their doors until a comprehensive theory of learning Is 
found. Neither can assessment activities be halted*. We must, instead, 
rely on human Judgment to solve practical problems while we work on 
basic theory* This is the situation currently faced by the National 
Assessment of Educational Progress (NAEP), 



A brief overview of the hii^iory and purposes on the National 



Assessment might be useful as background for a discussion of NA£P*s 
responses to important theoretical questions in the area of criterion 

referenced testing. , ^ 

* • 

By the early 1960*8 many billions of dollars wei^e being 
invested annually in the formal education of our young people. 
The only available measures of educational quality resulting 
from this investment had been based upon inputs into the 
* educational system such as teacher ^student ratios, number 
of classroomSt and number of dollars spent per student. The 
teruous assumption had been that the quality of educational 
" outcomes. --what students actually learn--was directly re- 
lated to the quality of the inputs into the educational system. 
Ko significant direct assessment of educational outcomes had 
been made. The typical siatc^administered or school-admin- 
istered achievement tests, which provided scores whereby 
one student could be compared with others, were useful for 
categorizing students; but they provided very little informa- 
tion about what students \\'ere actually learning* 

This Insufficiency of information became the concern of 
Francis Kcppel, United States Commissioner of Education 
(1962-1965), who initiated a scries of conferences to find 
ways in which it might be overcome* In 1964, as a result of 
these conferences, John \V\ Gardner, president of the 
Carnegie Corporation, asked a distinguished group of edu- 
cators and lay persons to form the Exploratory Committee 
on Assessing the Progress of Education (ECAPE). This 
committee, chaired by Dr. Ralph W. Tyler, was to examine 
the possibility of conducting an assessment of educational 
attainments on a national basis. 

After much study, ECAPE deemed that it was feasible to 
assess the knowlcdgesi understandings, skills, and atti- 
tudes in 10 subject areas ^ at four aj;e levels (9» 13, 17, and 
adult--ages 26-35). The project began its first assessment 
of the subject areas SclcncCi CiH/cnship, and Writing in the 
Spring of 1969. Later that same year, the project came 
under the auspices of the Education Commission of the States 
and was named the National Assessment of Educational 
Progress (NAEP). 



^Art, Career and Occupational Development, Citizenship, 
Literature, Malhemntics, Music, Reading, Science, 
Social Studies, and Writing. 
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For the £irat time, there would be a direct measure of 
educational outcomes which could be utilised hy school syrs« 
terns to improve the educational process, Sinc^ NACP is to 
be an ongoing project^ it will eventually be able to assess 
changes in these knowledges^ understandings, skills, and 
attitudes to deterrpine any changes in educational outcomes* 

Many people, prominent in education and measurement have con<- 
tributed hcavil/ to the purposes and processes o£ NAEP, A brief and 
very incomplete roster would include, besides Tyler* Keppcl and Gardner, 
Jack Merwin^ Frank Womer, Stanley Ahmann, John Tukey, Frederick 
Mosteller, and Lee Cronbach, 

Two subject areas are currently being assessed each year with a 
ti n year cycle for reassessment within a given subject aroa. The five 
year assessment*reassessment cycle and the 210 minutes allotted to 
each subject area at each age level in an assessment year place very 
practical constrints on the design and production of exercises (test items). 
The five year cycle requires continuous exercise development effort and, 
limits experimental and validation activities. The time aIlotm«5nt limits 
the number of exercises administered and hence, the depth of coverage 
for each objective. 

Universe Definition 

Some of the most intriguing questions in the field of criterion 
referenced measurement have to do with the rigorous definition of a 

2 

This section was ad'optcd from What is National Assessment by Dr. Frank 
Womer and The National Assessment Approach to Exorcise Dcvclonmv^r.t 
by Drs. Carmen J. r'inlcy and Frances S. Dcrdic anW may be cbtaincq iruni 
National Assessment of Educational Progress, Public Information Depart- 
ment, 300 Lincoln Tower, 1S60 Lincoln Street, Denver, Colorado 80203. 



domain of rclercnce (^subject matter) and of a universe of»bchaviors 

within that domain^ This paper will hticCly summarize some of those 

questions and indicate the general thrust of NACP's responses. The 

responses discussed in this paper are to be viewed only as current 

* 

positions of NAEP regarding the basic problems. They are in no sense 
offered as definitive solutions. 

'Two questions must be asked: 

1« What constitutes a definition of a 'domain of reference or 
a universe of behaviors? 

2. When can we be sure that a complete definition is achieved? 
Since the problems of defining a domain of reference and a universe 
of behavior arc parallel* discussion of a domain of reference can 
serve as a model for discussion of a universe of behavior s. 

It is clear that a complete definition of a domain of reference 
must include all knowledge* skills and attitudes directly related to 
the subject area and exclude all those that are not related. A similar 
statement could be made for defining a universe of behaviors by sub«« ^ 
stituting ''behaviors" for '^knowledge, skills and attitudes, " Such a 
definition need not be an enumeration. Indeed* such an enumeration 
would be uscles^ because of its extensive* if not infinite* length. 

What is needed then is a method of statement generation that 
will produce relevant and only relevant statements. We can be sure 
that a complete definition is achieved only when it can be logically 
shown that any statement or question that can be made by our statement 



generation mcchdnism is or is not a member of the set o( questions and 
statements contained in that domain or universe* 

Lacking a logically complete knowledge generator* it is not possible 
to make statistically defcnsablc and generalisable statements relating 
individual or group performance to a subject area by means of a restricted 
set of items« . Without a complete definition of the domain of reference 
and aninivcrsc of behaviors, all statements about the results of a cri-» 
terion referenced test must be confined to the items in that test without 
further generalizations. Clearly, this is not the purpose of any test maker« 

Several approaches to the problem of gcneralizability can be found 

in the literature* One approach is to ignore the problem altogether. 

Another is to indicate how certain domains and universes can be defined 

and systematically sampled. Unfortunately, those domains and unt- 

verses that liavc been discussed are typically narrowly restrictive or 

trivial or huiu* For example, tests of knowledge of word meanings 

can be constructed by defining the domain of reference as the Merriam- 

Webster Collegiate Dictionary, 7th Edition* All statements about words 

contained in that dictionary are relevant and all statements not contained 

in that dictionary are not relevant. One can then define the universe 

of behaviors as responses to a cloze test oh, the definitional entry for 

each word* Many schemes can then be devised for systematically 

sampling both the domain of reference and the universe o( behaviors. 

Item generation rules can be devised which will produce any number of 

♦ 

equivalent tests and the results oC those tests can indeed be generalized 



to knowledge ot word meanings as defined In the domain of referencc« 

Such schemes are of little value, however, in constructing tests to 

assess knowledge, skills ^nd attitudes in broader areas such as social 

studies, literature, music or'art* 

• • • 

Objectives 

^It is clearly beyond the current state of the art to define the uni* 

verse of discourse for a complex area in the strict sense discussed 

• • 

above. Yet it Is equally clear that a set of exercises (test items) which 
form a coherent assessment of a subject area cannot be constructed 
without some definition of the domain to be tested* Faced with this 
conundrum, NAEP has taken a humanistic rather than a statistical 
approach to universe definition* 

The term '^humanistic'* is used to indicate reliance on human 
Judgment rather than logical or statistical proof* \Vc define our uni- 
verse by producing a set of objectives that represent a consensus of 
opinion covering many segments of pur society regarding the important 
goals and outcomes of our educational processes in respect to a given 
subject area* 

The question. might well be raised, "Why add yet another formu- 
lation of educational goals and objcclivcs to the already existing 
plethora of such documents?" It is certainly a reasonable question 
and yet one that is easily answered in terms of NAEP's mission* NAEP, 
as its name slates, L*; a national assessment and as such is compelled 



to jiUend to thodc dspccts o( education who^e dcnnitlon and evaluation 
can be agreed upon for the society as a whole. Most of the myriad 
statements of objectives are produced by and for the use of schools at 
the local and state level. NAEP must go beyond that restricted view« 
point to identify goals that are accepted nationally. 

Since NAEP is also an assessment of change in educational 
outcomes over time we have the further responsibility to examine and 
revise our codifications of objectives on a'systenutic cyclical basis. 
These twin requirements of demonstrable national significance and 
conlinuous revision justify the effort to produce statements of goals 
and objectives that are unique to our own needs and purposes. 

NAEP defines the domain of reference in a subject area by 

arriving at a national consensus statement of goals in that area. Goals 

are stated in the form of overall objectives with attendant levels of 

sub-cbjcctivos. The form and structure of the objectives varies from 

one subject area to another and between assessment cycles within a . 

single subject area. For example^ a major objective and its sub- 

objectives for cycle 1 of Music were stated as follows: 

III. LISTEN TO MUSIC WITH UNDERSTANDING. 

A« PerccivMvthe various elements of muslc» such as timbre. rhythm» 
melod y u m1 harmony, and texture, 

1. Idcnllfy timbres. 

Age 9 Identify by categories the manner In which the in- 
strument is played {e«g., struck, bowed). 



Identify imiividu^l instrumental timbres 
unaccomptinied. 

Identify inrlividtul Instrumental timbres** 
with dccomp^nimcnt« 

(in addition to Age 9) 

Identify individual vocal timbres —with ^ 
accompaniment. 

Identify ensemble timbres, instrumental and 
vocal. 

Identify by categories families of related 
tlmbres(e«g« woodwindc, plucked strings)* 
Identify Individual instrumental timbres** 
unaccompanied. 

Identify individual instrumental and vocal 
tlmbres**wlth accompaniment. 
Identify ensemble timbres. Instrumental and 
vocal. 

A much more loosely defined objectives structure was produced for the 
first cycle of Literature assessment as shown by the following example: 

• * 
m. DEVELOP A CONTINUING INTEREST -^ND PARTICIPATION 
IN LITERATURE AND THE LITERARY EXPERIENCE 

* 

This gonl Is directed at assessing the Interests and attitudes; for the most 
part the goal is relevant to Age 17 and Adult. 

A, Be Inlellecuully oriented to literature. 

This Roal a sits oi the Individual a recognition o£ the Importance of 
Uieratur*; to lh« individual and society, and a recognition that literary 
expression requires a number of forms to enable it to become an art. 

AU ages Recognise the Importance of literature to an under* 
standing of cultures distant in time or distinct la 
history. 

Uocognisc il»e importance of literature to a compre- 
hension of the diversity and homogeneity of man. 
Recognize that participating in the literary exper- 
ience is a prime form of enjoyment. 



Age n 



Age 17 
Adults 
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Adulti tree iAocicty^ 

Recognize th^l the Ati of literature involveis d clofe 
connection Wtween form and contents 

The process of idcntifylni; and explicating objectived or revising 
those used in the previous cycle of assessment of a subject area is some* 
what complex and occupies a tin^ span of approximately nine months. A 
search of recent literature ia made to identify new trends in the subject 
area. The literature search is coupled with an examination of existing 
sets of written objectives such as those brought together by The Instruc- 
tional Objectives Exchange. This material forms a background for a 
number of working and review panels that produce and refine the objec- 
lives to be used as the basis for exercise development and for reporting 
of assessment results. 

In the early years of NAEP, objectives development was done by 
sub-contractors (AIR, £TS« SRA, etc.). They studied the literaturet 
examlciod existing objectives and produced a document that was cri* - 
tiqued by a variety of consultants and then revised. This plan was 
follou'ed for the objectives development of most of the first cycle assess^ 
ments. Leaving objectives development in the hands ot the contractors 
who then wrote the exercises not only produced objectives of uneven 
<^uality but also was fraught with the danger of producing only tiiutie 
objectives that were most easily measured and neglecting those that 
might be at least as Important to the education community but are diffl* 
cult to measure. 



With thc«e considcrdtionis in tnindi the tA«k of producing objectives 
WAS removed'from the pufvue of sub'contrdctors and mdde p^rt of the 
direct responiibllity of the Exercise tkvelopment department of NASP. 
A f tandardUed procedure (or developing ohjectives la now followed th^t 

V} 

l>egin9 with a mail review by subject matter oeperta of the objectives 
from the previous cycle* This mall review Is followed by a conference 
In ii^ich consultants determine the broad outlines of the desired revision* 
/ A sub •set of consultants frcm the first review conference produce a 

first draft of the revised objectives within the guidelines from the confer* 

€nce« This draft is revle^*ed by mall by members of the first conference 

« 

and a second draft is produced based ondte resulting comments* The 
second draft is then reviewed by a second conference of consultants some 
of whom were present at the first revision conference* Consensus is 
reached on the remaining points at issue among the consultants and the 
document Is adjusted accordingly and given a final editorial polish* 

The working and review panels are composed of consultants drawn 
from three major groups: scholars and educators within the subject 
area and qualified and Interested laymen* Between 3S and 50 consultants 
are Involved at one time and another In the development of objectives* 
Consultants are chosen with serious attention to representation by region 
(northcastp soutlteast, central and westh iypc of In^ititutlun (university, 
four year college, junior college, secondary and elementary schools and 
private schools), race and sex* Wherever clearly defined schools of 
thought hold differing* positions In a subject area, care Is given to assure 

■9 

* 
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representation of each of the conflicting points of view. The above des- 
cribes the selection of consultants who serve actively on panels. In 
the case of mail reviews, a much larger number of people are involved. 

The method just described, while far from ideal, does produce a set 
of objectives that represent as nearly as possible, within the constraints 
of time and money, a national consenses on educational goals and objec- 
tives that are currently valued by our society. Great emphasis is placed 
on producing objectives that are important without regard to their measur 
ability, NAEP views the objectives as defining the broad domain within 
which exercises are to be written and as a mandate from our society 
to produce data on related educational outcomes^ 

There are a number of important questions still unresolved in the 

area of objective development. The question with the deepest theoretical 

* 

implications is , '*To what depth of sub-objective level and of age speci- 
fic behavior should objectives be taken?" The major objectives are 
generally few in number and are of such a general nature that they pro- 
vide only an ambiguous guide to exercise development.. At each level of 
sub-objective, the domain of reference is more clearly defined but how 
clear that definition can be or should be is still an open question. There 
is currently a large variation in this matter from subject area to subject 
area and between assessment cycles within any given subject area. The 
use of age specific behaviors in the objectives furnishes the clearest 
definition and guide for exercise development. However, it is again a 
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question of viewing age specific behaviors as an exhaustive list of all 

possible behaviors, (an obviously impossible task) or simply as guidelines 

and illustrations for the exercise developers. 

« 

A second and related que?5tion has to do with the feasibility of 
developing some sort of hierarchial scheme of cogni-tive and affective 
objectives. Many such schemes have been devised but is it possible or 
even ^idvisible to choose one plan to the exclusion of all others? 

A final question has to do with standardizing the format of objec- 
tives. It has been suggested that from a quality control standpoint, a 
standardized fornniat and framework of objectives should be developed 
and applied to all subject areas. There is no solid agreement, however, 
that this plan, if it could be implemented, would be desirable. The 
discussion on this point revolves around the issue of the amount of 
freedom allowed to the developers of the objectives to express in their 
own. way those aspects of the subject area that they feel to be most 
important in our educational scheme. - 

Item Generation Rules 

In criterion referenced testing it would be desirable to identify a 
generally acceptable method for item construction. In the strict sense, 
such a method should provide a systematic sampling of a previously 
defined universe of behaviors. Further, it should be a set of rules 
which, if followed by more than one person or group of item writers 
with equivalent knowlcdi^e, would produce equivalent tests. We have 



already discussed the difficulties involved in domain and universe defini- 
tion in the complex areas of interest to NAEP. Since the universe of 
behav'ors has not been well defined, a systematic sampling scheme is 
difficult to devise. When the notion that a set of rules may be clearly 
enough stated that equivalent tests may be generated from them is exam- 
ined closely, it is easily seen that such rules, while useful in narrowly 
specialized areas, are not definable in other more complex areas. Tests 
of arithmetic computational skills, tests of word meanings and spelling " 
tests have been constructed using such rule sets. Indeed, on occasion, 
rules have been embodied in computer programs which will generate 
equivalent tests ad infinitum. Unfortunately, such tests, while complete 
in themselves, fall far short of being comprehensive tests of mathema- 
tics, reading or writing. 

Assuming for a moment that solutions were at hand for problems 
of defining the universe of behaviors and of stipulating an adequate set 
of rules for generating items, we are still faced-with a question of ser- 
ious theoretical consequences. The question might be phrased as.^'How 
much is enough?*' How many items are necessary for an acceptable test 
of an objective? If the objectives are complete through. the identification 
of one or more levels of sub-objectives under each major objective, and 
if each sub-objective is adequately tested, then we can certainly claim 
that we have an adequate test of a major objective. However, such a 
plan simply puts off the problem to another level of detail. We are still 
faced with the central question of how many items are necessary to test 
the lowest level sub-objective or any given age specific behavior. 



NAEP Exercise DcvGlopn-tc nt 

In light. of the problems outlined above, we may move to a brief 
discussion of the methods used by NAEP in generating exercises (test 
items). None of the activities to be described below are presented as 
final solutions but it will be seen that many of our item generating acti- 
vities, while perhaps tangential to the central problems as stated 
abovcj do stem from our abidirrg concern for such problems. Again, 
as in the definition of domains of reference and universes of behavior, 
it will also be seen that we continue to use a humanistic approach in 
the sense of relying primarily on the judgment of experts in the subject 
matter area. 

Following the development of objectives, . contracts are awarded 
through competitive bidding for the generation of e-xe raises to assess 
those objectives. The amount of exercise material to be developed for 
each sub-objective is based on a "weightii g'^ sch*=^me. Weights are 
assigned by subject matter experts who are experienced with students 
at the four age levels. For example,- the major objectives are weighted 
for their relative importance for nine-year-olds by teachers who have 
experience with that age group. Each sub-objective is then weighted 
for its relative importance within the major objective. This scheme is 
continued to the lowest level of sub-objective. The weights for an objcc 
tive may differ widely over age groups reflecting the importance of that 
objective at one age as opposed to another. 

-14- 



The use of weights is in some sense a response to the problem of 
providing adequate coverage for each sub-objectivu* Since the weight of 
the sub-objective is an index of its importance in relation to other objec- 
tives at a given age level, such weights can easily be translated into 
percentages of the total assessment time that it would be reasonable to 
spend in asse.ssing that particular sub-objective. This method of 
specifying coverage of course accounts only for amount of material re- 
lated to its importance and does not speak to the issue of relating cover- 
age to the complexity of the various objectives and sub-objectives. 

Murii attention has been paid by NAEP to the problem of giving 
contractors an adequate framework for preparing the kinds of exericses 
that will achieve coverage through a variety of approaches* We have 
arrived at a general notion of exercise prototypes which are not rules 
for exercise generation nor are they examples of specific exercises^ 
but rather attend to those aspects and variables of exercise generation 
that can be discussed. NAEP exercise prototypes are actually a tree 
structure showing mutually exclusive categories for four variables: 
Administration mode, stimulus mode, response mode and response 
category. The administration mode is dichotomous: an exercise can 
be administered either individually or tu a group. Branching from 
administration mode we define the stimulus mode as audio, visual, other 
senses (tactual, olfactory, etc. ) or some combination of the three. From 
each stimulus mode we show a dichotomy of response alternatives or 
response mode: objective (multiple choice) and free response. Finally, 



branching from each response mode Vv-e define response categories as 
written, verbal, role playing, group interaction, and other physical 
action. 

Such a tree structure results in 80 (2x4x2x5) possible proto- 
types. It is clear that not all possible prototypes are applicable to 
any gi^^en subject area. A panel of subject matter experts selects those 
prototypes that are most reasonable for assessing a subject area. Their 
input, in conjunction with practical-considerations of cost of administra- 
tion and scoring, provides the specification of percentage ranges 
(minimum aad maximum) in terms of minutes of material as guidelines 
for the contractor. The subject matter experts also produce exemplary 
exercises within the subj^?ct area for each prototype specified. The 
use of prototypes as a control for coverage through a variety of approaches 
is frankly experimental. Its first use will be in the current redevelop- 
ment of literature assessment, but it is expected to provide a more 
balanced body of exercises. 

Working within the weighted objectives, prototypes, and exemplary 
exercises, contractors produce the specified n^inutes of exercise ma- 
terial for assessment of a subject area. Each exercise produced by the 
contractor must be accompanied by a rationale relating that exercise 
to the sub-objective that it is proportinj; to measure. It must also be 
accompanied by a rationale relating that exercise to other exercises 
within the body of material to be used in the assessment. 



The exercises received from the contractor are subjected to at 
least four reviews by each of three groups: the NAEP f^tcaff, subject matter 
experts (scholars and educators) and qualified laymen. In addition to 
reviewing the exercise itself, the rationale relating that exercise to a sub- 
objective and to other exercises in the body of material is also brought 
under scrutiny. Some exercises survive each review session; others 
are sent back to the contractor for suggested revisions o.nd others, 
hopefully a small percentage, are rejected as being without merit and 
are no longer considered for use in the assessment. 

Those exercises that have survived the reviews, either in their 
Original or revised state, are then given a full field trial. \ Each exer- 
cise has been tried out during its developmental stages by the contractor 
and is submitted to NAEP accompanied by data from three sub-units of 
the population: extreme inner city, extreme rural, and affluent suburb. 
Data from the developmental tryouts consists of timing information, 
overall percentage correct responses, percentages of responses for . 
each foil in a multiple choice exercise, and the beginnings of a scoring 
guide or response categorization in the case of free response exercises. 
While these data are gathered from three sub-units of the population, 
the number of subjects contributing from each population is neces- 
sarily small. For increased reliability of this sort of data, we run 
extensive field trials on a national sample. The field trials, while far 
less extensive than the actual assessment, are large enough to yield 
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reliable data and also point up regional biases and administrative 
problems that might otherwise be missed. 

Following the field trials, the pool of exercises is reviewed by 
the United States Office of Education for possible offensivene ss in 
sensitive areas. Exercises surviving this last review by USOE are 
then examined by successive panels of subject matter experts in a 
selection conference. 

Since the attrition rate through all the reviews is unpredictable 
in any precise way, we order from the contractor a considerable 
overage of material. This overage is on the order of lOOTn plus an 
additional 20% that allows for contractor creativity outside of the 
specifications and guidelines furnished by NAEP. 

Since we are constrained to a total of 210 minutes of assess- 
ment for each subject area, a selection conference is necessary to 
choose the best among surviving exercises. Consultants at the 
selection conference are required to pay close attention to maintaining 
the balance over objectives and sub-objectives that was specified in 
the original contract and to the relationships between exercises 
that forms a coherent assessment. 

Validity 

Two main concerns of NAEP for the assessment exercises 
is for their content validity and their importance. Two questions 
are continually asked at every exercise review conference: ''Is this 
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i. valid measure of the objective forrwhich it was written?" 
it is valid, is it an important or a trivial measure of the objec- 
tive?" In'portance can onl/ be established by following the judgment 
of subj^'ct matter experts. Human judgment is also the primary check 
on vali'^ity* 

• {fowever, another measure of validity is available for some of 
the exercises by examining the assessment response data. If an item 
is admini'"*^^^^^ to two groups, one of which has had no training or 
experi'inc' ^'^ ^^^^ ^^cl the other has had extensive training, the 
results can be viewed as one measure of the item's validity. In the 
ideal case» a valid item would yield a score near zero for the untrained 
group 'ind approaching 100% correct for the highly trained group. Such 
a test is approximated for those NAEP exercises that overlap age 
groups. "^^y assumed that seventeen-year-olds have had more 
training i'' ^ given subject area than thirteen-year-olds when training 
in thai ar^a is a continuous process. The same assumption may be . 
made for ^^^niparison of thirteen-year-olds and nine -year -olds. If 
the same exercise is administered to the three age levels, an increasing 
perceritag^ of correct responses from nine- tp seventeen-year-olds can 
be acoept'^d as some assurance of the item's validity. In general, such 
has hcc-n (he case with NAEP data. U in the field trials a contrary 
instan'^G is found, that item is examined closely. If an adequate ex- 
planation IS not evident, the item is dropped from the assessment. 

• • 
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Summary 

Tesk construction is not the strictly logical process that we might 
wish it to be. This is particularly true in a large on-going project such 
as NAEP, Most of the really deep questions can only be answered by 
the exercise of well informed human juagment. Criterion referenced 
testing is still a term in search of a definition. It has been suggested 
that NAEP's exercises might be more properly called "objective 
referenced'^ tests. That is a reasonable title for our efforts since we 
are attempting to assess the degree of achievement of stated goals 
without reference to a predetermined level or criterion. Whatever the 
appropriate title may be, we share the concerns of all workers in the 
field with the same basic questions. But until satisfactory scientific 
solutions have been found; we, like the rest of education, must rely on 
the best human judgment available. 
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